Hello! My name is Hira Anees Awan. I am from Pakistan. Currently, I am pursuing a Master’s degree in Biostatistics at Duke University. I recently joined Tomaras Lab to work with Cesar Lopez. I love to teach and do computer programming so this workshop is kind of like the best case scenario for me.
You can email me at ha96@duke.edu for any queries or input. I would love your feedback on this workshop.
You will be wondering what is this beautiful notebook like thing where I can write paragraphs and insert chunks of code and run it like a script. Notice the extension of this file. It is ‘.Rmd’. This means it is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Here, I am providing a link to a brief cheatsheet to R markdown. You can do all kinds of cool stuff with R markdown. Today’s main agenda is not to teach you R markdown but I felt like this is a great tool to introduce to beginners who like to document every single line of code. And, it is fun too.
Let’s add a code chunk and try to run it in R.
2+2
[1] 4
Viola! Two plus two equals four. Great! Now that everything is up and running, I will explain what we are trying to do today.
The main agenda of today’s workshop is to get you acquainted with basics of Data Science using R. I must add that we are going to focus on small datasets. I believe a beginner’s workshop should focus on smaller datasets that are easy to manipulate and visualize.
At the conclusion of this workshop, you will be able:
You can almost load any kind of file in R. We will focus on the widely used files including text files and csv.
‘\t’ specifies that the file is tab-delimited.
head shows the first six rows of a data frame
TextFile_df <- read.table("Tab_Delimited_Text_File.txt", header = TRUE, sep = "\t")
head(TextFile_df)
‘header=TRUE’ means that the names of the columns are included in the text file
tail shows the last six rows of a data frame
train_df <- read.csv("train.csv",header = TRUE)
test_df <- read.csv("test.csv",header = TRUE)
head(train_df)
tail(test_df)
Here, I would like to stop and emphasize on some datatypes in R which you will come across very often.
Vector: It will have one datatype only. If you do not put the same datatype, it will enforce the same datatype across all the elements of the vector.
vector_1 <- c('Biostatistics','Electrical Engineering','Mechanical Engineering')
print(vector_1)
[1] "Biostatistics" "Electrical Engineering" "Mechanical Engineering"
vector_2 <- c('Covid', 10+9)
print(vector_1[1])
[1] "Biostatistics"
print(vector_2[2])
[1] "19"
List: It can have different datatypes.
# Create a list.
list_1 <- list(c(1,2,3),45,'Quarantine')
print(list_1)
[[1]]
[1] 1 2 3
[[2]]
[1] 45
[[3]]
[1] "Quarantine"
print(list_1[[1]])
[1] 1 2 3
Matrices: A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.
matrix_1 = matrix( c('h','i','r','a','9',5), nrow = 2, ncol = 3, byrow = TRUE)
print(matrix_1)
[,1] [,2] [,3]
[1,] "h" "i" "r"
[2,] "a" "9" "5"
print(matrix_1[1,])
[1] "h" "i" "r"
print(matrix_1[,2])
[1] "i" "9"
print(matrix_1[1,2])
[1] "i"
Arrays: While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. In the below example we create an array with two elements which are 3x3 matrices each
# Create an array.
array_1 <- array(c('orange','green'),dim = c(3,3,4))
print(array_1)
, , 1
[,1] [,2] [,3]
[1,] "orange" "green" "orange"
[2,] "green" "orange" "green"
[3,] "orange" "green" "orange"
, , 2
[,1] [,2] [,3]
[1,] "green" "orange" "green"
[2,] "orange" "green" "orange"
[3,] "green" "orange" "green"
, , 3
[,1] [,2] [,3]
[1,] "orange" "green" "orange"
[2,] "green" "orange" "green"
[3,] "orange" "green" "orange"
, , 4
[,1] [,2] [,3]
[1,] "green" "orange" "green"
[2,] "orange" "green" "orange"
[3,] "green" "orange" "green"
#first matrix
print(array_1[,,1])
[,1] [,2] [,3]
[1,] "orange" "green" "orange"
[2,] "green" "orange" "green"
[3,] "orange" "green" "orange"
#third column of first matrix
print(array_1[,3,1])
[1] "orange" "green" "orange"
#third row of second matrix
print(array_1[3,,2])
[1] "green" "orange" "green"
Factors: For categorical data
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
# Create a factor object.
factor_1 <- factor(apple_colors)
# Print the factor.
print(factor_1)
[1] green green yellow red red red green
Levels: green red yellow
print(nlevels(factor_1))
[1] 3
Let’s get back on track and work towards step 2 that is…
Selecting columns of a dataframe using dplyr
library(dplyr)
package 㤼㸱dplyr㤼㸲 was built under R version 3.6.3Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
somecolumns_1 <- select(train_df, Name, Sex, Age)
head(somecolumns_1)
# keep the variables name and all variables
# between parch and cabin inclusive
somecolumns_2 <- select(train_df,Parch:Cabin)
head(somecolumns_2)
# keep all variables except Embarked
somecolumns_3 <- select(train_df, -Embarked)
head(somecolumns_3)
Selecting rows of a dataframe using dplyr
library(dplyr)
somerows_1 <- filter(train_df,
Sex == "female")
head(somerows_1)
Mutating a dataframe using dplyr
#Changing Fare variable by multiplying each entry by 1.1
mutated_1 <- mutate(train_df, Fare = Fare * 1.1);
#Creating a new variable called FareCategorical that is low when Fare < 70 and #high otherwise
mutated_2 <- mutate(train_df,
FareCategorical = ifelse(Fare < 70, 'low','high'))
Using pipes Pipes are used to do multiple steps in one go. Here, first, we will filter the dataset by females. Resulting instances will be females, the next step is to group all the females by class. Now, all the females will be classified based on the Pclass variable. Last step is to create a variable called mean_age that will compute the mean of age in the sub groups of these females.
pipe_1 <- train_df %>%
filter(Sex == "female") %>%
group_by(Pclass) %>%
summarize(mean_age = mean(Age, na.rm = TRUE))
This is a very short section. We will look at how R can present summaries of different variables.
Summarizing a continous variable
summary(train_df$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.42 20.12 28.00 29.70 38.00 80.00 177
#Try running this ... Will not return a value...
mean(train_df$Age)
[1] NA
#Why? It has missing values... Solution? Remove the missing values.
mean(train_df$Age, na.rm=TRUE)
[1] 29.69912
Summarizing a categorical variable
table(train_df$Sex)
female male
314 577
ggplot
ggplot is one of the most famous libraries of R that is used for data visualizations. If you want to learn more about ggplots, here is a link for you.
library(ggplot2)
package 㤼㸱ggplot2㤼㸲 was built under R version 3.6.3
ggplot(data = train_df,
mapping = aes(x = PassengerId, y = Fare))
Why is the graph empty? We specified that the PassengerId variable should be mapped to the x-axis and that the Fare should be mapped to the y-axis, but we haven’t yet specified what we wanted placed on the graph.
Geoms to the rescue!
library(ggplot2)
ggplot(data = train_df,
mapping = aes(x = PassengerId, y = Fare))+
geom_point()
Let’s make it fancy and fit a line!
library(ggplot2)
ggplot(data = train_df,
mapping = aes(x = PassengerId, y = Fare))+
geom_point(color = "cornflowerblue",
alpha = .7,
size = 3)+
geom_smooth(method = "lm")
Histograms
library(scales)
package 㤼㸱scales㤼㸲 was built under R version 3.6.3
ggplot(train_df,
aes(x = Age,
y= ..count.. / sum(..count..))) +
geom_histogram(fill = "Blue",
color = "white",
binwidth = 5) +
labs(title="Travellers by age",
y = "Percent",
x = "Age") +
scale_y_continuous(labels = percent)
Categorical data
ggplot(train_df, aes(x = Sex)) +
geom_bar(fill = "cornflowerblue",
color="black") +
labs(x = "Sex",
y = "Frequency",
title = "Travellers by Sex")+
coord_flip()
Pie chart
# create a pie chart with slice labels
plotdata <- train_df %>%
count(Pclass) %>%
arrange(desc(Pclass)) %>%
mutate(prop = round(n*100/sum(n), 1),
lab.ypos = cumsum(prop) - 0.5*prop)
plotdata$label <- paste0(plotdata$Pclass, "\n",
round(plotdata$prop), "%")
ggplot(plotdata,
aes(x = "",
y = prop,
fill = Pclass)) +
geom_bar(width = 1,
stat = "identity",
color = "black") +
geom_text(aes(y = lab.ypos, label = label),
color = "black") +
coord_polar("y",
start = 0,
direction = -1) +
theme_void() +
theme(legend.position = "FALSE") +
labs(title = "Participants by class")
Tree map
library(treemapify)
package 㤼㸱treemapify㤼㸲 was built under R version 3.6.3
# create a treemap of marriage officials
plotdata <- train_df %>%
count(Pclass)
ggplot(plotdata,
aes(fill = Pclass,
area = n,
label = Pclass)) +
geom_treemap() +
geom_treemap_text(colour = "white",
place = "centre") +
labs(title = "Training data by class") +
theme(legend.position = "none")
Visualizing 3 variables
# plot fare histograms by Pclass
ggplot(train_df, aes(x = Fare)) +
geom_histogram(fill = "cornflowerblue",
color = "white") +
facet_wrap(~Pclass, ncol = 1) +
labs(title = "Fare histograms by sex")
# plot fare histograms by sex and pclass
ggplot(train_df, aes(x = Fare)) +
geom_histogram(color = "white",
fill = "cornflowerblue") +
facet_grid(Sex ~ Pclass) +
labs(title = "Fare histograms by sex and pclass",
x = "Fare")
Miscellaneous graphs
library(ggplot2)
library(plotly)
package 㤼㸱plotly㤼㸲 was built under R version 3.6.3Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: 㤼㸱plotly㤼㸲
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
last_plot
The following object is masked from 㤼㸱package:stats㤼㸲:
filter
The following object is masked from 㤼㸱package:graphics㤼㸲:
layout
p <- ggplot(train_df, aes(x=PassengerId,
y=Fare,
color=Pclass)) +
geom_point(size=3) +
labs(x = "Passenger Id",
y = "Fare",
color = "Passenger Class") +
theme_bw()
ggplotly(p)
Interesting video that beautifully explains what a neural network is… link
# load library
require(neuralnet)
Loading required package: neuralnet
package 㤼㸱neuralnet㤼㸲 was built under R version 3.6.3
Attaching package: 㤼㸱neuralnet㤼㸲
The following object is masked from 㤼㸱package:dplyr㤼㸲:
compute
m <- model.matrix(
~ Survived + Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data = train_df
)
head(m)
(Intercept) Survived Pclass Sexmale Age SibSp Parch Fare EmbarkedC
1 1 0 3 1 22 1 0 7.2500 0
2 1 1 1 0 38 1 0 71.2833 1
3 1 1 3 0 26 0 0 7.9250 0
4 1 1 1 0 35 1 0 53.1000 0
5 1 0 3 1 35 0 0 8.0500 0
7 1 0 1 1 54 0 0 51.8625 0
EmbarkedQ EmbarkedS
1 0 1
2 0 0
3 0 1
4 0 1
5 0 1
7 0 1
library(neuralnet)
r <- neuralnet(
Survived ~ Pclass + Sexmale + Age + SibSp + Parch + Fare + EmbarkedC + EmbarkedQ + EmbarkedS,
data=m, hidden=10, threshold=0.5
)
plot(r)
m2 <- model.matrix(
~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data = test_df
)
Predict=compute(r,m2)
prob <- Predict$net.result
pred <- ifelse(prob>0.5, 1, 0)
PCA is a famous dimensionality reduction technique and you want to find that which component/set of components introduce the most variation in your response variable. In other words, we are interested in finding those variables which are mathematically the richest in terms of information.
Note: PCA can only be run on continuous variables.
library(tidyr)
train_pca_df <- select(train_df, Age, SibSp, Parch, Fare)
train_pca_df <- train_pca_df %>% drop_na()
pca_result <- prcomp(train_pca_df, center = TRUE,scale. = TRUE)
summary(pca_result)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.2794 1.0522 0.8182 0.7659
Proportion of Variance 0.4092 0.2768 0.1673 0.1467
Cumulative Proportion 0.4092 0.6860 0.8533 1.0000
Interpretation:
I will definitely keep age in the model, when I am using other machine learning algorithms because the variable age introduces the highest percentage of total variation in the given dataset.